[NVPTX] Add im2colw/w128 modes support to TMA intrinsics #148863

durga4github · 2025-07-15T15:08:43Z

This patch adds support for the im2col-w/w128 and scatter/gather modes
for TMA Copy and Prefetch intrinsics, completing support for all the
available modes. These are lowered through tablegen, building
on top of earlier patches.

lit tests are added for all the combinations and verified with a
12.8 ptxas executable.
Documentation is updated in the NVPTXUsage.rst file.

llvmbot · 2025-07-15T15:09:19Z

@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-nvptx

Author: Durgadoss R (durga4github)

Changes

This patch adds support for the im2col-w/w128 and scatter/gather modes
for TMA Copy and Prefetch intrinsics, completing support for all the
available modes.

lit tests are added for all the combinations and verified with a
12.8 ptxas executable.
Documentation is updated in the NVPTXUsage.rst file.

Patch is 226.88 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/148863.diff

12 Files Affected:

(modified) llvm/docs/NVPTXUsage.rst (+152-9)
(modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+69-9)
(modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+1)
(modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+158-9)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-cta-sm100.ll (+193)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-cta-sm100a.ll (+150)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-cta-sm90.ll (+351)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-gather4.ll (+174)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-im2colw.ll (+524)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-im2colw128.ll (+524)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-prefetch-sm100a.ll (+171)
(added) llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-s2g-scatter4.ll (+52)

diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 11017fe4e01b4..d28eb6860c33a 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -1072,6 +1072,8 @@ Syntax:
   declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
   declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
 
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.gather4.2d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+
 Overview:
 """""""""
 
@@ -1082,7 +1084,13 @@ global memory to shared::cluster memory (indicated by the ``g2s`` prefix)
 in ``tile`` mode. In tile mode, the multi-dimensional layout of the
 source tensor is preserved at the destination. The dimension of the
 tensor data ranges from 1d to 5d with the coordinates specified
-by the ``i32 %d0 ... i32 %d4`` arguments.
+by the ``i32 %d0 ... i32 %d4`` arguments. In ``tile.gather4`` mode,
+four rows in a 2D tensor are combined to form a single 2D destination
+tensor. The first coordinate ``i32 %x0`` denotes the column index
+followed by four coordinates indicating the four row-indices.
+So, this mode takes a total of 5 coordinates as input arguments.
+For more information on ``gather4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
 
 * The last three arguments to these intrinsics are flags
   indicating support for multicast, cache_hint and cta_group::1/2
@@ -1116,10 +1124,18 @@ Syntax:
 
 .. code-block:: llvm
 
-  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
   declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
   declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
 
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.3d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.3d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
 Overview:
 """""""""
 
@@ -1131,10 +1147,105 @@ in ``im2col`` mode. In im2col mode, some dimensions of the source tensor
 are unrolled into a single dimensional column at the destination. In this
 mode, the tensor has to be at least three-dimensional. Along with the tensor
 coordinates, im2col offsets are also specified (denoted by
-``i16 im2col0...i16 %im2col2``). The number of im2col offsets is two less
-than the number of dimensions of the tensor operation. The last three arguments
-to these intrinsics are flags, with the same functionality as described
-in the ``tile`` mode intrinsics above.
+``i16 im2col0...i16 %im2col2``). For the ``im2col`` mode, the number of offsets
+is two less than the number of dimensions of the tensor operation. For the
+``im2col.w`` and ``im2col.w.128`` mode, the number of offsets is always 2,
+denoted by ``i16 %wHalo`` and ``i16 %wOffset`` arguments. For more information
+on ``im2col.w`` and ``im2col.w.128`` modes, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes>`_.
+
+The last three arguments to these intrinsics are flags, with the same functionality
+as described in the ``tile`` mode intrinsics above.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.1d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.2d(..., i32 %d0, i32 %d1, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.gather4.2d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i64 %ch, i1 %flag_ch)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.[1-5]d``' intrinsics
+correspond to the ``cp.async.bulk.tensor.[1-5]d.shared::cta.global.*``
+set of PTX instructions. These instructions initiate an asynchronous
+copy of tensor data from global memory to shared::cta memory in
+``tile`` mode. In tile mode, the multi-dimensional layout of the
+source tensor is preserved at the destination. The dimension of the
+tensor data ranges from 1d to 5d with the coordinates specified
+by the ``i32 %d0 ... i32 %d4`` arguments. In ``tile.gather4`` mode,
+four rows in a 2D tensor are combined to form a single 2D destination
+tensor. The first coordinate ``i32 %x0`` denotes the column index
+followed by four coordinates indicating the four row-indices.
+So, this mode takes a total of 5 coordinates as input arguments.
+For more information on ``gather4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
+
+* The last argument to these intrinsics is a boolean flag
+  indicating support for cache_hint. This flag argument must
+  be a compile-time constant. When set, it indicates a valid
+  cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
+  variant of the PTX instruction.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.[3-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i64 %ch, i1 %flag_ch)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.[3-5]d``' intrinsics
+correspond to the ``cp.async.bulk.tensor.[1-5]d.shared::cta.global.*``
+set of PTX instructions. These instructions initiate an asynchronous copy
+of tensor data from global memory to shared::cta memory in ``im2col`` mode.
+In im2col mode, some dimensions of the source tensor are unrolled into a
+single dimensional column at the destination. In this mode, the tensor has
+to be at least three-dimensional. Along with the tensor coordinates, im2col
+offsets are also specified (denoted by ``i16 im2col0...i16 %im2col2``).
+For the ``im2col`` mode, the number of offsets is two less than the number
+of dimensions of the tensor operation. For the ``im2col.w`` and ``im2col.w.128``
+mode, the number of offsets is always 2, denoted by ``i16 %wHalo`` and
+``i16 %wOffset`` arguments. For more information on ``im2col.w`` and
+``im2col.w.128`` modes, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes>`_.
+
+* The last argument to these intrinsics is a boolean flag
+  indicating support for cache_hint. This flag argument must
+  be a compile-time constant. When set, it indicates a valid
+  cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
+  variant of the PTX instruction.
 
 For more information, refer PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
@@ -1153,6 +1264,8 @@ Syntax:
   declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
   declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
 
+  declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.scatter4.2d(ptr addrspace(3) %src, ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i64 %ch, i1 %flag_ch)
+
 Overview:
 """""""""
 
@@ -1162,6 +1275,12 @@ These instructions initiate an asynchronous copy of tensor data from
 shared::cta to global memory (indicated by the ``s2g`` prefix)
 in ``tile`` mode. The dimension of the tensor data ranges from 1d to 5d
 with the coordinates specified by the ``i32 %d0 ... i32 %d4`` arguments.
+In ``tile.scatter4`` mode, a single 2D source tensor is divided into
+four rows in the 2D destination tensor. The first coordinate ``i32 %x0``
+denotes the column index followed by four coordinates indicating the
+four row-indices. So, this mode takes a total of 5 coordinates as input arguments.
+For more information on ``scatter4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
 
 * The last argument to these intrinsics is a boolean flag
   indicating support for cache_hint. This flag argument must
@@ -1214,6 +1333,8 @@ Syntax:
   declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
   declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
 
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.gather4.2d(ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i64 %ch, i1 %flag_ch)
+
 Overview:
 """""""""
 
@@ -1225,6 +1346,13 @@ multi-dimensional layout of the source tensor is preserved at the destination.
 The dimension of the tensor data ranges from 1d to 5d with the coordinates
 specified by the ``i32 %d0 ... i32 %d4`` arguments.
 
+In ``tile.gather4`` mode, four rows in the 2-dimnesional source tensor are
+fetched to the L2 cache. The first coordinate ``i32 %x0`` denotes the column index
+followed by four coordinates indicating the four row-indices. So, this mode takes
+a total of 5 coordinates as input arguments.
+For more information on ``gather4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
+
 * The last argument to these intrinsics is a boolean flag
   indicating support for cache_hint. This flag argument must
   be a compile-time constant. When set, it indicates a valid
@@ -1246,6 +1374,14 @@ Syntax:
   declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
   declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
 
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.3d(ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.3d(ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
 Overview:
 """""""""
 
@@ -1256,9 +1392,16 @@ of tensor data from global memory to the L2 cache. In im2col mode, some
 dimensions of the source tensor are unrolled into a single dimensional
 column at the destination. In this mode, the tensor has to be at least
 three-dimensional. Along with the tensor coordinates, im2col offsets are
-also specified (denoted by ``i16 im2col0...i16 %im2col2``). The number
-of im2col offsets is two less than the number of dimensions of the tensor
-operation. The last argument to these intrinsics is a boolean flag, with
+also specified (denoted by ``i16 im2col0...i16 %im2col2``). For ``im2col``
+mode, the number of offsets is two less than the number of dimensions of
+the tensor operation. For the ``im2col.w`` and ``im2col.w.128`` modes,
+the number of offsets is always 2, denoted by ``i16 %wHalo`` and
+``i16 %wOffset`` arguments. For more information on ``im2col.w`` and
+``im2col.w.128`` modes, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes>`_.
+
+
+The last argument to these intrinsics is a boolean flag, with
 the same functionality as described in the ``tile`` mode intrinsics above.
 
 For more information, refer PTX ISA
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 0375f29ad8906..5ddc14445908b 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -2024,9 +2024,7 @@ foreach dim = 1...5 in {
                       tensor_dim_args,      // actual tensor dims
                       [llvm_i64_ty]),       // cache_hint
           [llvm_i1_ty],                     // Flag for cache_hint
-          [IntrConvergent,
-           ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
-           NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>]>;
+          [IntrConvergent, ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>]>;
 
     // Intrinsics for TMA Copy with reduction
     foreach red_op = ["add", "min", "max", "inc", "dec", "and", "or", "xor"] in
@@ -2037,18 +2035,31 @@ foreach dim = 1...5 in {
                          tensor_dim_args,     // actual tensor dims
                         [llvm_i64_ty]),       // cache_hint
           [llvm_i1_ty],                       // Flag for cache_hint
-          [IntrConvergent, ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
-           NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>]>;
+          [IntrConvergent, ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>]>;
   }
 }
 
+// TMA S2G tile::scatter4
+def int_nvvm_cp_async_bulk_tensor_s2g_tile_scatter4_2d
+  : DefaultAttrsIntrinsicFlags<[],
+      !listconcat([llvm_shared_ptr_ty,        // src_smem_ptr
+                   llvm_ptr_ty],              // tensormap_ptr
+                  !listsplat(llvm_i32_ty, 5), // dims
+                  [llvm_i64_ty]),             // cache_hint
+      [llvm_i1_ty],                           // Flag for cache_hint
+      [IntrConvergent, ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>]>;
+
 // TMA Tensor Copy Intrinsics: G2S -> From Global to Shared memory variants
 foreach dim = 1...5 in {
   defvar tensor_dim_args = !listsplat(llvm_i32_ty, dim);
 
-  foreach mode = !if(!ge(dim, 3), ["tile", "im2col"], ["tile"]) in {
+  foreach mode = !if(!ge(dim, 3), ["tile", "im2col", "im2col_w", "im2col_w_128"], ["tile"]) in {
     defvar is_im2col = !eq(mode, "im2col");
-    defvar num_im2col_offsets = !if(is_im2col, !add(dim, -2), 0);
+    defvar is_im2colw = !or(!eq(mode, "im2col_w"), !eq(mode, "im2col_w_128"));
+
+    // For im2col_w/w128 modes, the num_offsets is always 2.
+    // For im2col mode, the num_offsets is (dim - 2).
+    defvar num_im2col_offsets = !if(is_im2colw, 2, !if(is_im2col, !add(dim, -2), 0));
     defvar im2col_offsets_args = !listsplat(llvm_i16_ty, num_im2col_offsets);
 
     defvar g2s_params = !listconcat(
@@ -2079,11 +2090,60 @@ foreach dim = 1...5 in {
                        im2col_offsets_args, // im2col offsets
                       [llvm_i64_ty]),       // cache_hint
           [llvm_i1_ty],                     // Flag for cache_hint
-          [IntrConvergent,
-           ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>]>;
+          [IntrConvergent, ReadOnly<ArgIndex<0>>]>;
+
+    def int_nvvm_cp_async_bulk_tensor_g2s_cta_ # mode # _ # dim # d :
+      DefaultAttrsIntrinsicFlags<[],
+          !listconcat([llvm_shared_ptr_ty,  // dst_ptr
+                       llvm_shared_ptr_ty,  // mbarrier_ptr
+                       llvm_ptr_ty],        // tensormap_ptr
+                       tensor_dim_args,     // actual tensor dims
+                       im2col_offsets_args, // im2col offsets
+                       [llvm_i64_ty]),      // cache_hint
+          [llvm_i1_ty],                     // Flag for cache_hint
+          [IntrConvergent, WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>]>;
   }
 }
 
+// TMA copy for tile::gather4
+def int_nvvm_cp_async_bulk_tensor_g2s_tile_gather4_2d
+  : DefaultAttrsIntrinsicFlags<[],
+      !listconcat(
+      [llvm_shared_cluster_ptr_ty, // dst_shared_cluster_ptr
+       llvm_shared_ptr_ty,         // mbarrier_ptr
+       llvm_ptr_ty],               // tensormap_ptr
+       !listsplat(llvm_i32_ty, 5), // co-ordinates
+      [llvm_i16_ty,                // cta_mask
+       llvm_i64_ty]),              // cache_hint
+      [llvm_i1_ty,                 // Flag for cta_mask
+       llvm_i1_ty,                 // Flag for cache_hint
+       llvm_i32_ty],               // Flag for cta_group
+      [IntrConvergent,
+       WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>,
+       // Allowed values for cta_group are {0,1,2} i.e [0, 3).
+       Range<ArgIndex<12>, 0, 3>]>;
+
+def int_nvvm_cp_async_bulk_tensor_g2s_cta_tile_gather4_2d
+  : DefaultAttrsIntrinsicFlags<[],
+      !listconcat(
+      [llvm_shared_ptr_ty,         // dst_shared_ptr
+       llvm_shared_ptr_ty,         // mbarrier_ptr
+       llvm_ptr_ty],               // tensormap_ptr
+       !listsplat(llvm_i32_ty, 5), // co-ordinates
+      [llvm_i64_ty]),              // cache_hint
+      [llvm_i1_ty],                // Flag for cache_hint
+      [IntrConvergent,
+       WriteOnly<ArgIndex<0>>, ReadOnly<ArgIndex<2>>]>;
+
+// TMA prefetch for tile::gather4
+def int_nvvm_cp_async_bulk_tensor_prefetch_tile_gather4_2d
+  : DefaultAttrsIntrinsicFlags<[],
+      !listconcat([llvm_ptr...
[truncated]

durga4github · 2025-07-17T13:27:42Z

Ping,
@Artem-B , Could you please help with a review here?

Artem-B

LGTM overall, with few nits.

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-cta-sm100.ll

This patch adds support for the following modes in the TMA intrinsics: * TMA G2S Copy: im2col_w, im2col_w_128 and tile_gather4_2d. * TMA Prefetch: im2col_w, im2col_w_128 and tile_gather4_2d. * TMA S2G Copy: tile_scatter4_2d mode * TMA G2S-CTA copy: tile, im2col, im2col_w, im2col_w_128 and tile_gather4_2d. Signed-off-by: Durgadoss R <[email protected]>

llvm-ci · 2025-07-19T12:14:14Z

LLVM Buildbot has detected a new failure on builder arc-builder running on arc-worker while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/3/builds/19252

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/X86/sse2-intrinsics-fast-isel.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/buildbot/worker/arc-folder/build/bin/llc < /buildbot/worker/arc-folder/llvm-project/llvm/test/CodeGen/X86/sse2-intrinsics-fast-isel.ll -show-mc-encoding -fast-isel -mtriple=i386-unknown-unknown -mattr=+sse2 | /buildbot/worker/arc-folder/build/bin/FileCheck /buildbot/worker/arc-folder/llvm-project/llvm/test/CodeGen/X86/sse2-intrinsics-fast-isel.ll --check-prefixes=CHECK,X86,SSE,X86-SSE # RUN: at line 2
+ /buildbot/worker/arc-folder/build/bin/llc -show-mc-encoding -fast-isel -mtriple=i386-unknown-unknown -mattr=+sse2
+ /buildbot/worker/arc-folder/build/bin/FileCheck /buildbot/worker/arc-folder/llvm-project/llvm/test/CodeGen/X86/sse2-intrinsics-fast-isel.ll --check-prefixes=CHECK,X86,SSE,X86-SSE
LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse2.lfence
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /buildbot/worker/arc-folder/build/bin/llc -show-mc-encoding -fast-isel -mtriple=i386-unknown-unknown -mattr=+sse2
1.	Running pass 'Function Pass Manager' on module '<stdin>'.
2.	Running pass 'X86 DAG->DAG Instruction Selection' on function '@test_mm_lfence'
 #0 0x00000000023210e8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/buildbot/worker/arc-folder/build/bin/llc+0x23210e8)
 #1 0x000000000231dff5 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #2 0x00007fa336c4a630 __restore_rt sigaction.c:0:0
 #3 0x00007fa33599a3d7 raise (/usr/lib64/libc.so.6+0x363d7)
 #4 0x00007fa33599bac8 abort (/usr/lib64/libc.so.6+0x37ac8)
 #5 0x000000000071ae1d llvm::json::operator==(llvm::json::Value const&, llvm::json::Value const&) (.cold) JSON.cpp:0:0
 #6 0x00000000020b0f59 llvm::SelectionDAGISel::CannotYetSelect(llvm::SDNode*) (/buildbot/worker/arc-folder/build/bin/llc+0x20b0f59)
 #7 0x00000000020b5b2a llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) (/buildbot/worker/arc-folder/build/bin/llc+0x20b5b2a)
 #8 0x00000000009594e7 (anonymous namespace)::X86DAGToDAGISel::Select(llvm::SDNode*) X86ISelDAGToDAG.cpp:0:0
 #9 0x00000000020ac81f llvm::SelectionDAGISel::DoInstructionSelection() (/buildbot/worker/arc-folder/build/bin/llc+0x20ac81f)
#10 0x00000000020bc618 llvm::SelectionDAGISel::CodeGenAndEmitDAG() (/buildbot/worker/arc-folder/build/bin/llc+0x20bc618)
#11 0x00000000020c028a llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) (/buildbot/worker/arc-folder/build/bin/llc+0x20c028a)
#12 0x00000000020c0ee5 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) (/buildbot/worker/arc-folder/build/bin/llc+0x20c0ee5)
#13 0x00000000020ac03f llvm::SelectionDAGISelLegacy::runOnMachineFunction(llvm::MachineFunction&) (/buildbot/worker/arc-folder/build/bin/llc+0x20ac03f)
#14 0x0000000001201527 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (.part.0) MachineFunctionPass.cpp:0:0
#15 0x000000000185dfa2 llvm::FPPassManager::runOnFunction(llvm::Function&) (/buildbot/worker/arc-folder/build/bin/llc+0x185dfa2)
#16 0x000000000185e341 llvm::FPPassManager::runOnModule(llvm::Module&) (/buildbot/worker/arc-folder/build/bin/llc+0x185e341)
#17 0x000000000185ef57 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/buildbot/worker/arc-folder/build/bin/llc+0x185ef57)
#18 0x00000000007f8382 compileModule(char**, llvm::LLVMContext&) llc.cpp:0:0
#19 0x0000000000723336 main (/buildbot/worker/arc-folder/build/bin/llc+0x723336)
#20 0x00007fa335986555 __libc_start_main (/usr/lib64/libc.so.6+0x22555)
#21 0x00000000007ee5c6 _start (/buildbot/worker/arc-folder/build/bin/llc+0x7ee5c6)
/buildbot/worker/arc-folder/llvm-project/llvm/test/CodeGen/X86/sse2-intrinsics-fast-isel.ll:473:13: error: X86-NEXT: expected string not found in input
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax # encoding: [0x8b,0x44,0x24,0x04]
            ^
<stdin>:247:9: note: scanning from here
# %bb.0:
        ^
<stdin>:248:2: note: possible intended match here
 pause # encoding: [0xf3,0x90]
 ^
/buildbot/worker/arc-folder/llvm-project/llvm/test/CodeGen/X86/sse2-intrinsics-fast-isel.ll:1862:14: error: SSE-LABEL: expected string not found in input
; SSE-LABEL: test_mm_extract_epi16:
             ^
<stdin>:870:16: note: scanning from here
test_mm_div_sd: # @test_mm_div_sd
...

llvm-ci · 2025-07-19T13:30:14Z

LLVM Buildbot has detected a new failure on builder clang-x64-windows-msvc running on windows-gcebot2 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/63/builds/7950

Here is the relevant piece of the build log for the reference

Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/clang-windows.py ...' (failure)
...
[321/323] Running the Clang regression tests
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:57: note: using lit tools: C:\Program Files\Git\usr\bin
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using clang: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\subst.py:126: note: Did not find cir-opt in C:\b\slave\clang-x64-windows-msvc\build\stage1\bin;C:\b\slave\clang-x64-windows-msvc\build\stage1\bin
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using ld.lld: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\ld.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using lld-link: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\lld-link.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using ld64.lld: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\ld64.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using wasm-ld: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\wasm-ld.exe
-- Testing: 22678 tests, 32 workers --
Testing:  0.. 10.. 20.. 30.. 40
FAIL: Clang :: CodeGen/SystemZ/builtins-systemz-zvector.c (790 of 22678)
******************** TEST 'Clang :: CodeGen/SystemZ/builtins-systemz-zvector.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 2
c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe -cc1 -internal-isystem C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu  -O2 -fzvector -flax-vector-conversions=none  -Wall -Wno-unused -Werror -emit-llvm C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c -o - | c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe' -cc1 -internal-isystem 'C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include' -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu -O2 -fzvector -flax-vector-conversions=none -Wall -Wno-unused -Werror -emit-llvm 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c' -o -
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe' 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c'
# RUN: at line 5
c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe -cc1 -internal-isystem C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu  -O2 -fzvector -flax-vector-conversions=none  -Wall -Wno-unused -Werror -S C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c -o - | c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c --check-prefix=CHECK-ASM
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe' -cc1 -internal-isystem 'C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include' -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu -O2 -fzvector -flax-vector-conversions=none -Wall -Wno-unused -Werror -S 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c' -o -
# .---command stderr------------
# | fatal error: error in backend: Cannot select: intrinsic %llvm.s390.vpklsg
# | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
# | Stack dump:
# | 0.	Program arguments: c:\\b\\slave\\clang-x64-windows-msvc\\build\\stage1\\bin\\clang.exe -cc1 -internal-isystem C:\\b\\slave\\clang-x64-windows-msvc\\build\\stage1\\lib\\clang\\22\\include -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu -O2 -fzvector -flax-vector-conversions=none -Wall -Wno-unused -Werror -S C:\\b\\slave\\clang-x64-windows-msvc\\llvm-project\\clang\\test\\CodeGen\\SystemZ\\builtins-systemz-zvector.c -o -
# | 1.	<eof> parser at end of file
# | 2.	Code generation
# | 3.	Running pass 'Function Pass Manager' on module 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c'.
# | 4.	Running pass 'SystemZ DAG->DAG Pattern Instruction Selection' on function '@test_core'
# `-----------------------------
# error: command failed with exit status: 70
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe' 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c' --check-prefix=CHECK-ASM
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c --check-prefix=CHECK-ASM
# `-----------------------------
# error: command failed with exit status: 2

--

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 
FAIL: Clang :: CodeGen/SystemZ/builtins-systemz-zvector2.c (1889 of 22678)
******************** TEST 'Clang :: CodeGen/SystemZ/builtins-systemz-zvector2.c' FAILED ********************
Exit Code: 2

Step 8 (stage 1 check) failure: stage 1 check (failure)
...
[321/323] Running the Clang regression tests
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:57: note: using lit tools: C:\Program Files\Git\usr\bin
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using clang: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\subst.py:126: note: Did not find cir-opt in C:\b\slave\clang-x64-windows-msvc\build\stage1\bin;C:\b\slave\clang-x64-windows-msvc\build\stage1\bin
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using ld.lld: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\ld.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using lld-link: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\lld-link.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using ld64.lld: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\ld64.lld.exe
llvm-lit.py: C:\b\slave\clang-x64-windows-msvc\llvm-project\llvm\utils\lit\lit\llvm\config.py:520: note: using wasm-ld: c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\wasm-ld.exe
-- Testing: 22678 tests, 32 workers --
Testing:  0.. 10.. 20.. 30.. 40
FAIL: Clang :: CodeGen/SystemZ/builtins-systemz-zvector.c (790 of 22678)
******************** TEST 'Clang :: CodeGen/SystemZ/builtins-systemz-zvector.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 2
c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe -cc1 -internal-isystem C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu  -O2 -fzvector -flax-vector-conversions=none  -Wall -Wno-unused -Werror -emit-llvm C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c -o - | c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe' -cc1 -internal-isystem 'C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include' -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu -O2 -fzvector -flax-vector-conversions=none -Wall -Wno-unused -Werror -emit-llvm 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c' -o -
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe' 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c'
# RUN: at line 5
c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe -cc1 -internal-isystem C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu  -O2 -fzvector -flax-vector-conversions=none  -Wall -Wno-unused -Werror -S C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c -o - | c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c --check-prefix=CHECK-ASM
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\clang.exe' -cc1 -internal-isystem 'C:\b\slave\clang-x64-windows-msvc\build\stage1\lib\clang\22\include' -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu -O2 -fzvector -flax-vector-conversions=none -Wall -Wno-unused -Werror -S 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c' -o -
# .---command stderr------------
# | fatal error: error in backend: Cannot select: intrinsic %llvm.s390.vpklsg
# | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
# | Stack dump:
# | 0.	Program arguments: c:\\b\\slave\\clang-x64-windows-msvc\\build\\stage1\\bin\\clang.exe -cc1 -internal-isystem C:\\b\\slave\\clang-x64-windows-msvc\\build\\stage1\\lib\\clang\\22\\include -nostdsysteminc -target-cpu z13 -triple s390x-linux-gnu -O2 -fzvector -flax-vector-conversions=none -Wall -Wno-unused -Werror -S C:\\b\\slave\\clang-x64-windows-msvc\\llvm-project\\clang\\test\\CodeGen\\SystemZ\\builtins-systemz-zvector.c -o -
# | 1.	<eof> parser at end of file
# | 2.	Code generation
# | 3.	Running pass 'Function Pass Manager' on module 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c'.
# | 4.	Running pass 'SystemZ DAG->DAG Pattern Instruction Selection' on function '@test_core'
# `-----------------------------
# error: command failed with exit status: 70
# executed command: 'c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe' 'C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c' --check-prefix=CHECK-ASM
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  c:\b\slave\clang-x64-windows-msvc\build\stage1\bin\filecheck.exe C:\b\slave\clang-x64-windows-msvc\llvm-project\clang\test\CodeGen\SystemZ\builtins-systemz-zvector.c --check-prefix=CHECK-ASM
# `-----------------------------
# error: command failed with exit status: 2

--

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 
FAIL: Clang :: CodeGen/SystemZ/builtins-systemz-zvector2.c (1889 of 22678)
******************** TEST 'Clang :: CodeGen/SystemZ/builtins-systemz-zvector2.c' FAILED ********************
Exit Code: 2

llvm-ci · 2025-07-19T13:45:39Z

LLVM Buildbot has detected a new failure on builder sanitizer-x86_64-linux-bootstrap-asan running on sanitizer-buildbot2 while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/52/builds/9796

Here is the relevant piece of the build log for the reference

Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure)
...
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld.lld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/ld.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/main.py:73: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 91378 tests, 88 workers --
Testing:  0.. 10
FAIL: Clang :: Interpreter/pretty-print.cpp (13695 of 91378)
******************** TEST 'Clang :: Interpreter/pretty-print.cpp' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl "int i = 10;" 'extern "C" int printf(const char*,...);'             'auto r1 = printf("i = %d\n", i);' | /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck --check-prefix=CHECK-DRIVER /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp # RUN: at line 1
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl 'int i = 10;' 'extern "C" int printf(const char*,...);' 'auto r1 = printf("i = %d\n", i);'
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck --check-prefix=CHECK-DRIVER /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp
cat /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp | /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl -Xcc -std=c++11 -Xcc -fno-delayed-template-parsing | /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp # RUN: at line 5
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl -Xcc -std=c++11 -Xcc -fno-delayed-template-parsing
+ cat /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp
JIT session error: In graph incr_module_21-jitted-objectbuffer, section .text.startup: relocation target 0x7aab39024044 (_ZN2S3D2Ev:0x7aab39024048 + 0xfffffffffffffffc) is out of range of Delta32 fixup at address 0x7aab39024048 (<anonymous block> @ 0x76ab37e0d010 + 0x18)
error: Failed to materialize symbols: { (main, { $.incr_module_21.__inits.0, s3, __orc_init_func.incr_module_21 }) }
JIT session error: Failed to materialize symbols: { (main, { s3 }) }
error: Failed to materialize symbols: { (main, { __orc_init_func.incr_module_22, $.incr_module_22.__inits.0 }) }
error: Failed to materialize symbols: { (main, { __orc_init_func.incr_module_21 }) }
/home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp:24:16: error: CHECK-NEXT: expected string not found in input
// CHECK-NEXT: (S3 &) @0x{{[0-9a-f]+}}
               ^
<stdin>:5:21: note: scanning from here
(S3) @0x76eb38040230
                    ^
<stdin>:10:5: note: possible intended match here
((lambda) &) @0x76ab36901000
    ^

Input file: <stdin>
Check file: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp

-dump-input=help explains the following input dump.

Input was:
<<<<<<
           1: (const char[3]) "ab" 
           2: (char[2]) { '1', 'a' } 
           3: (char[3]) "1a" 
           4: (char[2][3][1]) { { { 'a' }, { 'b' }, { 'c' } }, { { 'd' }, { 'e' }, { 'f' } } } 
Step 11 (stage2/asan check) failure: stage2/asan check (failure)
...
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld.lld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/ld.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using lld-link: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/lld-link
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using ld64.lld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/ld64.lld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/llvm/config.py:520: note: using wasm-ld: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/wasm-ld
llvm-lit: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/llvm/utils/lit/lit/main.py:73: note: The test suite configuration requested an individual test timeout of 0 seconds but a timeout of 900 seconds was requested on the command line. Forcing timeout to be 900 seconds.
-- Testing: 91378 tests, 88 workers --
Testing:  0.. 10
FAIL: Clang :: Interpreter/pretty-print.cpp (13695 of 91378)
******************** TEST 'Clang :: Interpreter/pretty-print.cpp' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
/home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl "int i = 10;" 'extern "C" int printf(const char*,...);'             'auto r1 = printf("i = %d\n", i);' | /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck --check-prefix=CHECK-DRIVER /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp # RUN: at line 1
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl 'int i = 10;' 'extern "C" int printf(const char*,...);' 'auto r1 = printf("i = %d\n", i);'
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck --check-prefix=CHECK-DRIVER /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp
cat /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp | /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl -Xcc -std=c++11 -Xcc -fno-delayed-template-parsing | /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp # RUN: at line 5
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/clang-repl -Xcc -std=c++11 -Xcc -fno-delayed-template-parsing
+ cat /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp
+ /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm_build_asan/bin/FileCheck /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp
JIT session error: In graph incr_module_21-jitted-objectbuffer, section .text.startup: relocation target 0x7aab39024044 (_ZN2S3D2Ev:0x7aab39024048 + 0xfffffffffffffffc) is out of range of Delta32 fixup at address 0x7aab39024048 (<anonymous block> @ 0x76ab37e0d010 + 0x18)
error: Failed to materialize symbols: { (main, { $.incr_module_21.__inits.0, s3, __orc_init_func.incr_module_21 }) }
JIT session error: Failed to materialize symbols: { (main, { s3 }) }
error: Failed to materialize symbols: { (main, { __orc_init_func.incr_module_22, $.incr_module_22.__inits.0 }) }
error: Failed to materialize symbols: { (main, { __orc_init_func.incr_module_21 }) }
/home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp:24:16: error: CHECK-NEXT: expected string not found in input
// CHECK-NEXT: (S3 &) @0x{{[0-9a-f]+}}
               ^
<stdin>:5:21: note: scanning from here
(S3) @0x76eb38040230
                    ^
<stdin>:10:5: note: possible intended match here
((lambda) &) @0x76ab36901000
    ^

Input file: <stdin>
Check file: /home/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/clang/test/Interpreter/pretty-print.cpp

-dump-input=help explains the following input dump.

Input was:
<<<<<<
           1: (const char[3]) "ab" 
           2: (char[2]) { '1', 'a' } 
           3: (char[3]) "1a" 
           4: (char[2][3][1]) { { { 'a' }, { 'b' }, { 'c' } }, { { 'd' }, { 'e' }, { 'f' } } }

llvmbot added backend:NVPTX llvm:ir labels Jul 15, 2025

durga4github requested a review from Artem-B July 15, 2025 15:11

Artem-B approved these changes Jul 17, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td Outdated Show resolved Hide resolved

llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-g2s-cta-sm100.ll Show resolved Hide resolved

durga4github force-pushed the durgadossr/nvptx_tma_im2col_w branch from b942c3d to 994a8c6 Compare July 18, 2025 09:23

durga4github merged commit 3866e4e into llvm:main Jul 19, 2025
10 checks passed

durga4github deleted the durgadossr/nvptx_tma_im2col_w branch July 19, 2025 11:46

adurang mentioned this pull request Jul 23, 2025

test abhinavgaba/llvm-project#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] Add im2colw/w128 modes support to TMA intrinsics #148863

[NVPTX] Add im2colw/w128 modes support to TMA intrinsics #148863

Uh oh!

durga4github commented Jul 15, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 15, 2025 •

edited

Loading

Uh oh!

durga4github commented Jul 17, 2025

Uh oh!

Artem-B left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvm-ci commented Jul 19, 2025

Uh oh!

llvm-ci commented Jul 19, 2025

Uh oh!

llvm-ci commented Jul 19, 2025

Uh oh!

Uh oh!

[NVPTX] Add im2colw/w128 modes support to TMA intrinsics #148863

[NVPTX] Add im2colw/w128 modes support to TMA intrinsics #148863

Uh oh!

Conversation

durga4github commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

durga4github commented Jul 17, 2025

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvm-ci commented Jul 19, 2025

Uh oh!

llvm-ci commented Jul 19, 2025

Uh oh!

llvm-ci commented Jul 19, 2025

Uh oh!

Uh oh!

durga4github commented Jul 15, 2025 •

edited

Loading

llvmbot commented Jul 15, 2025 •

edited

Loading